In [ ]:
The file "../data/CH4.pdb" contains the coordinates of methane molecule in a PDB format. The file consists of header followed by record lines which contain the following fields:
record name(=ATOM), atom serial number, atom name, x-,y-,z-coordinates, occupancy and temperature factor.
i.e.
ATOM 2 H -0.627 -0.627 0.627 0.00 0.00
Convert the file into XYZ format: first line contains the number of atoms, second line is title string, and the following lines contain the atomic symbols and x-, y-, z- coordinates, all separated by white space. Write the coordinates with 6 decimals:
5
Converted from PDB
C 0.000000 0.000000 0.000000
...
Only focus on printing the output now. Writing into a file comes next.
Hints:
In [ ]:
In [ ]:
Many data exchange formats are so-called delimiter separated values. The most commonly known of these is CSV.
There are multiple caveats in the format, e.g. European languages use comma (,) as a decimal separator and semicolon (;) as the field separator. Most pure-English systems use the dot (.) for decimal separation and the comma (,) for field separation.
Another family of systems uses whitespace, like space or tab characters to separate fields.
Python's csv library supports most of the variance in different formats and it can be a time-saving tool to those who use Python and deal with file formats a lot.
The file "../data/iris.data" is actually in CSV format even though the file ending doesn't explicitly say so (this is common).
Read in iris.data and write out a tab-separated file "iris.tsv" using the csv module.
Hint: because the first line of the input file has labels, csv.DictReader and csv.DictWriter are a good choice.
In [ ]:
The file "../data/word_count.txt" contains a short piece of text. Determine the frequency of words in the file, i.e. how many times each word appears. Print out the ten most frequent words.
Read the file line by line and use the split() function for separating a line into words. The frequencies are stored most conveniently into a dictionary. The dictionary method setdefault can be useful here.
For sorting, convert the dictionary into a list of (key, value) pairs with the items() function:
words = {"foo" : 1, "bar" : 2}
print(words.items())
[('foo', 1), ('bar', 2)]
In [ ]:
Fasta is a fileformat for storing nucleotide sequences. The sequences consist of header line, starting with >, followed by one or more lines containing the amino acids of the sequence presented by single-letter codes:
>5IRE:A|PDBID|CHAIN|SEQUENCE
IRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELVTTTVSNMAEVRSYCYEASISDMASDSRCPTQGEA
YLDKQSDTQYVCKRTLVDRGWGNGCGLFGKGSLVTCAKFACSKKMTGKSIQPENLEYRIMLSVHGSQHSGMIVNDTGHET
...
The file "../data/5ire.fasta" contains sequences for multiple chains of Zika virus. Read from the file the sequence of chain C (the chain ids are given in the header, i.e. the chain above is A).
Find out which chains contain the subsequence LDFSDL.
Hints:
In [ ]: